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The vertex-centric programming model is an established computational paradigm recently incorporated into 
distributed processing frameworks to address challenges in large-scale graph processing. Billion-node graphs 
that exceed the memory capacity of standard machines are not well-supported by popular Big Data tools like 
MapReduce, which are notoriously poor-performing for iterative graph algorithms such as PageRank. In re¬ 
sponse, a new type of framework challenges one to “think like a vertex” (TLAV) and implements user-defined 
programs from the perspective of a vertex rather than a graph. Such an approach improves locality, demon¬ 
strates linear scalability, and provides a natural way to express and compute many iterative graph algorithms. 
These frameworks are simple to program and widely applicable, but, like an operating system, are composed of 
several intricate, interdependent components, of which a thorough understanding is necessary in order to elicit 
top performance at scale. To this end, the first comprehensive survey of TLAV frameworks is presented. In this 
survey, the vertex-centric approach to graph processing is overviewed, TLAV frameworks are deconstructed into 
four main components and respectively analyzed, and TLAV implementations are reviewed and categorized. 


I. INTRODUCTION 

The proliferation of mobile devices, ubiquity of 
the web, and plethora of sensors has led to an expo¬ 
nential increase in the amount data created, stored, 
managed, and processed. In March 2014, an IBM 
report claimed that 90% of the world’s data had been 
generated in the last two years flj65l . Big Data char¬ 
acterizes the problems faced by conventional ana¬ 
lytics systems with this dramatic expansion of data 
volume, velocity, and variety. 

To address the challenges posed by Big Data, an¬ 
alytical systems are shifting from shared, central¬ 
ized architectures to distributed, decentralized ar¬ 
chitectures. The MapReduce framework, and its 
open-source variant, Hadoop, exemplifies this effort 
by introducing a programming model to facilitate 
efficient, distributed algorithm execution while ab¬ 
stracting away lower-level details l32l . Since incep¬ 
tion, the Hadoop/MapReduce ecosystem has grown 
considerably in support of related Big Data tasks. 

However, these distributed frameworks are not 
suited for all purposes, in many cases can even re¬ 
sult in poor performance j3Tl [59] |85| . Algorithms 
that make use of multiple iterations, especially those 
using graph or matrix data representations, are par¬ 
ticularly poorly suited for popular Big Data process¬ 
ing systems. 

Graph computation is notoriously difficult to scale 
and parallelize, often due to inherent interdependen¬ 
cies within graph data tm As Big Data drives 
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graph sizes beyond the memory capacity of a sin¬ 
gle machine, data must be partitioned to out-of- 
memory storage or distributed memory. However, 
for sequential graph algorithms, which require ran¬ 
dom access to all graph data, poor locality and the 
indivisibility of the graph structure cause time- and 
resource-intensive pointer-chasing between storage 
mediums in order to access each datum. 

In response to these shortcomings, new frame¬ 
works based on the vertex-centric programming 
model have been developed with the potential to 
transform the ways in which researchers and prac¬ 
titioners approach and solve certain problems f80l . 
Vertex-centric computing frameworks are platforms 
that iteratively execute a user-defined program over 
vertices of a graph. The user-defined vertex func¬ 
tion typically includes data from adjacent vertices 
or incoming edges as input, and the resultant out¬ 
put is communicated along outgoing edges. Ver¬ 
tex program kernels are executed iteratively for a 
certain number of rounds, or until a convergence 
property is met. As opposed to the randomly- 
accessible, “global” perspective of the data em¬ 
ployed by conventional shared-memory sequential 
graph algorithms, vertex-centric frameworks em¬ 
ploy a local, vertex-oriented perspective of compu¬ 
tation, encouraging practitioners to “think like a ver¬ 
tex” (TLAV). 

The first published TLAV framework was 
Google’s Pregel system isa, which, based off of 
Valiant’s Bulk Synchronous Parallel (BSP) model 
cm employs synchronous execution. While not 
all TLAV frameworks are synchronous, these frame¬ 
works are first introduced here within the context of 
BSP in order to provide foundational understanding 
of TLAV concepts. 
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A. Bulk Synchronous Parallel 

After spending a year with Bill McColl at Oxford 
in 1988, Les Valiant published the seminal paper 
on the Bulk Synchronous Parallel (BSP) computing 
model 111301 for guiding the design and implemen¬ 
tation of parallel algorithms. Initially touted as “A 
Bridging Model for Parallel Computation,” the BSP 
model was created to simplify the design of software 
for parallel hardware, thereby “bridging” the gap be¬ 
tween high-level programming languages and multi¬ 
processor systems. 

As opposed to distributed shared memory or other 
distributed systems abstractions, BSP makes heavy 
use of a message passing interface (MPI) which 
avoids high latency reads, deadlocks and race con¬ 
ditions. BSP is, at the most basic level, a two step 
process performed iteratively and synchronously: 
1) perform task computation on local data, and 2) 
communicate the results, and then repeat the two 
steps. In BSP each compute/communicate iteration 
is called a superstep, with synchronization of the 
parallel tasks occurring at the superstep barriers, de¬ 
picted in Figure [T] 


B. Graph Parallel Systems 

Introduced in 2010, the Pregel system lf80l is a 
BSP implementation that provides an API specifi¬ 
cally tailored for graph algorithms, challenging the 
programmer to “think like a vertex.” Graph algo¬ 
rithms are developed in terms of what each vertex 
has to compute based on local vertex data, as well as 
data from incident edges and adjacent vertices. The 
Pregel framework, as well other synchronous TLAV 
implementations, split computation into BSP-style 
supersteps. Analogous to “components” in BSP 
cm at each superstep a vertex can execute the 
user-defined vertex function and then send results 
to neighbors along graph edges. Supersteps always 
end with a synchronization barrier, shown in Fig- 
ure[T] which guarantees that messages sent in a given 
superstep are received at the beginning of the next 
superstep. Unlike the original BSP model, vertices 
may change status between active and inactive, de¬ 
pending on the overall state of execution. Pregel ter¬ 
minates when all vertices halt and no more messages 
are exchanged. 

A comparison of TLAV frameworks and BSP 
is presented in Figure [2] BSP employs a general 
model of broad applicability, including graph al¬ 
gorithms at varying levels of granularity. Under¬ 
lying BSP execution is the global synchronization 
barrier among distributed processors. TLAV frame¬ 
works utilize a vertex-centric programming model. 


and while Pregel and its derivatives employ BSP- 
founded synchronous execution, other frameworks 
implement asynchronous execution, which has been 
demonstrated to improve performance in some in¬ 
stances ED. 

In contrast to TLAV and BSP, MapReduce does 
not natively support iterative algorithms. Several 
recent frameworks have extended the MapReduce 
model to support iterative execution ll57il . but for it¬ 
erative graph algorithms, the graph topological data, 
which remains static, must be transferred from map¬ 
pers to reducers, resulting in significant network 
overhead that renders iterative MapReduce frame¬ 
works uncompetitive with TLAV frameworks E). 
A theoretical comparison between MapReduce and 
BSP is presented in 19T1 . 

C. TLAV Frameworks 

Since Pregel, several TLAV frameworks have 
been proposed that either employ conceptually al¬ 
ternative framework components (such as asyn¬ 
chronous execution), or improve upon the Pregel 
model with various optimizations. This survey 
provides the first comprehensive examination into 
TLAV framework concepts, and makes these other 
contributions: 

1. Analyzes 4 principle components in the de¬ 
sign of vertex programs execution in TLAV 
frameworks, identifying the trade-offs in com¬ 
ponent implementations and providing data- 
driven discussion 

2. Overviews approaches related to TLAV sys¬ 
tem architecture, including fault tolerance on 
distributed systems and novel techniques for 
large-scale processing on single-machines 

3. Discusses how the scalability of a graph al¬ 
gorithm varies inversely with the algorithm’s 
scope, illustrated by vertex-centric and related 
subgraph-centric, or hybrid, frameworks 

This article is organized as follows: First, Sec¬ 
tion |II] overviews the vertex-centric programming 
model, including an example program and execu¬ 
tion. Section [HI] presents the four major design de¬ 
cisions, or pillars, of the vertex-centric model. Sec¬ 
tion [TV] presents details for distributed implementa¬ 
tion, as well as novel techniques utilized by TLAV 
frameworks that enable large-scale graph processing 
on a single machine. Section [V] presents subgraph¬ 
centric, or hybrid, frameworks, that adopt a com¬ 
putational scope of the graph that is greater than a 
vertex (TLAV) but less than the entire graph. Sec¬ 
tion [VI] discusses related work. Finally, Section |VH| 
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FIG. 1: Example of Bulk Synchronous Parallel execution with 3 tasks/workers over 4 supersteps. Each task 
may have varying durations after which messages are passed. The barriers control synchronization across 

the entire system. 


TLAV BSP 



FIG. 2: Comparison of the Think Like a Vertex 
(TLAV) and Bulk Synchronous Parallel (BSP) 
models of computation. Both models are 
commonly employed for iterative computation. 


presents a summary, conclusions, and directions for 
future work. 


First, a brief note on terminology: The TLAV 
paradigm is described interchangeably as vertex- 
centric, vertex-oriented, or think-like-a-vertex. A 
vertex program kernel refers to an instance of the 
user-defined vertex program, function, or process 
that is executed on a particular vertex. A graph 
is a data structure made up of vertices and edges, 
both with (potentially empty) data properties. As 
in the literature, graph and network may be used 
interchangeably, as may node and vertex, and edge 
and link. Network may also refer to hardware con¬ 
necting two or more machines, depending on con¬ 
text. A worker refers to a slave machine in the con¬ 
ventional master-worker architectural pattern, and a 
worker process is the program that governs worker 
behavior, including, but not limited to, execution of 
vertex programs, inter-machine communication, ter¬ 
mination, check-pointing, etc. Graphs are assumed 
to be directed without loss of generality. 


II. OVERVIEW 

Graph processing is transitioning from central¬ 
ized to decentralized design patterns. Sequential, 
shared-memory graph algorithms are inherently cen¬ 
tralized. Conventional graph algorithms, such as Di- 
jkstra’s shortest path 8351 or betweenness centrality 
Eol, receive the entire graph as input, presume all 
data is randomly accessible in memory ( i.e., graph- 
omniscient algorithms), and a centralized computa¬ 
tional agent processes the graph in a sequential, top- 
down manner. However, the unprecedented size of 
Big Data-produced graphs, which may contain hun¬ 
dreds of billions of nodes and occupy terabytes of 
data or more, exceed the memory capacity of stan¬ 
dard machines. Moreover, attempting to centrally 
compute graph algorithms across distributed mem¬ 
ory results in unmanageable pointer-chasing 8781 . A 
more local, decentralized approach is required for 
processing graphs of scale. 

Think like a vertex frameworks are platforms that 
iteratively execute a user-defined program over ver¬ 
tices of a graph. The vertex program is designed 
from the perspective of a vertex, receiving as input 
the vertex’s data as well as data from adjacent ver¬ 
tices and incident edges. The vertex program is ex¬ 
ecuted across vertices of the graph synchronously, 
or may also be executed asynchronously. Execution 
halts after either a specified number of iterations, 
or all vertices have converged. The vertex-centric 
programming model is less expressive than con¬ 
ventional graph-omniscient algorithms, but is easily 
scalable with more opportunity for parallelism. 

The frameworks are founded in the field of dis¬ 
tributed algorithms. Although vertex-centric algo¬ 
rithms are local and bottom-up, they have a prov¬ 
able, global result. TLAV frameworks are heav¬ 
ily influenced by distributed algorithms theory, in¬ 
cluding synchronicity and communication mecha¬ 
nisms ED- Several distributed algorithm implemen¬ 
tations, such as distributed Bellman-Ford single- 




































































4 


source shortest path |79l , are used as benchmarks 
throughout the TLAV literature. The recent intro¬ 
duction of TLAV frameworks has also spurred the 
adaptation of many popular Machine Learning and 
Data Mining (MLDM) algorithms into graph rep¬ 
resentations for high-performance TLAV processing 
of large-scale data sets CD. 

Many graph problems can be solved by both a se¬ 
quential, shared-memory algorithm as well as a dis¬ 
tributed, vertex-centric algorithm. For example, the 
PageRank algorithm for calculating web-page im¬ 
portance has a centralized matrix form l92l as well 
as a distributed, vertex-centric form lf80l . The exis¬ 
tence of both forms illustrates that many problems 
can be solved in more than one way, by more than 
one approach or computational perspective, and de¬ 
ciding which approach to use depends on the task 
at hand. While the sequential, shared-memory ap¬ 
proach is often more intuitive and easier to imple¬ 
ment on a single machine or centralized architecture, 
the limits of such an approach are being reached. 

Vertex programs, in contrast, only depend on data 
local to a vertex, and reduce computational com¬ 
plexity by increasing communication between pro¬ 
gram kernels. As a result, TLAV frameworks are 
highly scalable and inherently parallel, with man¬ 
ageable inter-machine communication. For exam¬ 
ple, runtime on the Pregel framework has been 
shown to scale linearly with the number of ver¬ 
tices on 300 machines l80l . Furthermore, TLAV 
frameworks provide a common interface for vertex- 
program execution, abstracting away low-level de¬ 
tails of distributed computation, like MPI, allowing 
for a fast, re-usable development environment. A 
paradigm shift from centralized to decentralized ap¬ 
proaches to problem solving is represented by TLAV 
frameworks. 


A. Example: Single Source Shortest Path in TLAV 
paradigm 

The following describes a simple vertex program 
that calculates the shortest paths from a given ver¬ 
tex to all other vertices in a graph. In contrast to 
this distributed implementation example, consider a 
centralized, sequential, shared-memory, or “graph- 
omniscient,” solution to the single-source shortest 
path algorithm known as Djikstra’s algorithm £3 
or the more general BellmanFord algorithm D3I- 
Both Dijkstra’s and the Bellman-Ford algorithms 
are based on repeated relaxations, which iteratively 
replace distance estimates with more accurate values 
until eventually reaching the solution. Both variants 
are have a superlinear time complexity: Djisktra’s 
runs in 0(\E\ log \E\ + |V|) and Bellman-Ford’s 


runs in 0(\E\ x |L|), where \E\ is the number of 
edges and | V\ is the number of vertices in the graph 
and typically \E\ 2> \V\. Perhaps more importantly, 
both procedural, shared-memory algorithms keep a 
large state matrix resulting in a space complexity of 

o(\vn 

In contrast, to solve the same single-source short¬ 
est path problem in the TLAV programming model, 
a vertex program need only pass the minimum value 
of its incoming edges to its outgoing edges during 
each superstep. This algorithm, considered a dis¬ 
tributed version of Bellman-Ford m, is shown in 
Alg-Q] The computational complexity of each ver¬ 
tex program kernel is less than that of the sequential 
solution, however a new dimension is introduced in 
terms of the communication complexity, or the mes¬ 
saging between vertices ED. For TLAV implemen¬ 
tation, a user need only to write the inner-portion of 
Alg. [I] denoted by line numbers; the outermost loop 
and the parallel execution is handled by the frame¬ 
work. Because lines 1-10 are executed on the each 
vertex these lines are known as the vertex program. 

The TLAV-solution to the single source shortest 
path problem has surprisingly few lines of code, and 
understating its execution requires a different way of 
thinking. 

Figure [3] depicts the execution of Alg. |T] for a 
graph with 4 vertices and 6 weighted directed edges. 
Only the source vertex begins in an active state. In 
each superstep, a vertex processes its incoming mes¬ 
sages, determines the smallest value among all mes¬ 
sages received, and if the smallest received value 
is less than the vertex’s current shortest path, then 
the vertex adopts the new value as its shortest path, 
and sends the new path length plus respective edge 
weights to outgoing neighbors. If a vertex does not 
receive any new messages, then the vertex becomes 
inactive, represented as a shaded vertex in Figure [3] 
Overall execution halts once no more messages are 
sent and all vertices are inactive. 

With this example providing insight into TLAV 
operation, particularly the synchronous message¬ 
passing model of Pregel, the survey continues by 
more completely detailing TLAV properties and cat¬ 
egorizing different TLAV frameworks. 


m. FOUR PILLARS OF TLAV FRAMEWORKS 

A TLAV framework is software that supports the 
iterative execution of a user-defined vertex programs 
over vertices of a graph. Frameworks are composed 
of several interdependent components that drive pro¬ 
gram execution and ultimate system performance. 
These frameworks are not unlike an analytic operat¬ 
ing system, where component design decisions die- 
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Algorithm 1: Single Source Shortest Path for a Synchronized TLAV Framework 

input: A graph (1/, E) = G with vertices v £ V and edges from i -4 j s.t. etj £ E, 
and starting point vertex v 3 £ V 

foreach v £ V do shrtest_path_len„ t— oo; /* initialize each vertex data to oo */ 

send(0, t s ); /* to activate, send msg of 0 to starting point */ 

repeat /* The outer loop is synchronized with BSP-styled barriers */ 

for v £ V do in parallel /* vertices execute in parallel */ 

/* vertices inactive by default; activated when msg received */ 

/* compute minimum value received from incoming neighbors */ 

1 minlncomingData^— min(receive (pathdength)); 

/* set current vertex-data to minimum value */ 

2 if minlncomingData < shrtest path len, then 

3 shrtest_path_len„ ■£- minlncomingData; 

4 foreach e v j £ E do 

/* send shortest path + edge weight to outgoing edges */ 

5 pathdength 4— shrtest_pathden„+weight e ; 

6 send (pathdength, j); 

7 end 

8 end 

9 halt (); 

end 

until no more messages are sent ; 


1 



Superstep 0 

message values = 2 and 4 


Superstep 1 

message values = 4, 3, and 8 


Superstep 2 

message values = 6 and 7 


Superstep 3 

Complete, no new messages 


FIG. 3: Computing the Single Source Shortest Path in a graph. Dashed lines between supersteps represent 
messages (with values listed to the right), and shaded vertices are inactive. Edge weights pictorially included 
in first layer for Superstep 0, then subsequently omitted. 


tate how computations for a particular topology uti¬ 
lize the underlying hardware. 

This section introduces the four principle pillars 
of TLAV frameworks. They are: 

1. Timing - How user-defined vertex programs 
are scheduled for execution 

2. Communication - How vertex program data is 
made accessible to other vertex programs 

3. Execution Model - Implementation of vertex 
program execution and flow of data 

4. Partitioning - How vertices of the graph, orig¬ 
inally in storage, are divided up to be stored 


across memory of the system’s multiple[? ] 
worker machines 


The discussion proceeds as follows: the tim¬ 
ing policy of vertex programs is presented in Sub- 
III A| where system execution can be syn- 


section 

chronous, asynchronous, or hybrid. Communica¬ 
tion between vertex programs is presented in Sub¬ 
section IIIB| where intermediate data is shared pri¬ 
marily through message-passing or shared-memory. 
The implementation of vertex program execution is 
presented in Subsection III C which overviews pop¬ 
ular models of program execution and demonstrates 
how a particular model implementation impacts ex- 
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ecution and performance. Finally, partitioning of the 
graph from storage into distributed memory is pre¬ 
sented in Subsection UIIDI 

Each pillar is heavily interdependent with other 
pillars, as each design decision is tightly integrated 
and strongly influenced by other design decisions. 
While each pillar may be understood through a 
sequential reading of the information provided, a 
more efficient, yet thorough understanding may be 
achieved by freely forward- and cross-referencing 
other pillars, especially when related sections are 
cited. The inter-relation of the four pillars is un¬ 
avoidable and indivisible, not unlike a graph data 
structure itself. The difficulty of independently de¬ 
scribing each pillar certainly reflects the challenge 
of processing a vertex in which a given result de¬ 
pends on the concurrent processing of neighboring 
vertices. This survey is restricted to a sequential 
presentation of information in the form of a paper. 
However, each pillar, though unique, depends on, 
and may only be described in relation to, other pil¬ 
lars, so a sufficient understanding of any given pil¬ 
lar may only be achieved by understanding all pil¬ 
lars of a TLAV framework, collectively. Thus one 
may begin to understand the challenges of process¬ 
ing graphs (especially large graphs, when not all 
“pillars” are in the same “paper”) as in Section [TJ 
Section [II] and 1781 . 


A. Timing 

In TLAV frameworks, the scheduling and timing 
of the execution is separate from the logic of the ver¬ 
tex program. The timing of a framework character¬ 
izes how active vertices are ordered by the scheduler 
for computation. Timing can be synchronous, asyn¬ 
chronous, or a hybrid of the two models. Frame¬ 
works that represent the different fundamental tim¬ 
ing models are presented in Table [T] 


1. Synchronous 

The synchronous timing model is based on the 
original bulk synchronous parallel (BSP) processing 
model discussed above. In this model, active ver¬ 
tices are executed conceptually in parallel over one 
or more iterations, called supersteps. Synchroniza¬ 
tion is achieved through a global synchronization 
barrier situated between each superstep that blocks 
vertices from computing the next superstep until 
all workers complete the current superstep. Each 
worker coordinates with the master to progress to 
the next superstep. Synchronization is achieved be¬ 
cause the barrier ensures that each vertex within a 


superstep has access to only the data from the previ¬ 
ous superstep. Within a single processing unit, ver¬ 
tices can be scheduled in a fixed or random order 
because the execution order does not affect the state 
of the program. The global synchronization barrier 
introduces several performance trade-offs. 


Synchronous systems are conceptually simple, 
demonstrate scalability, and perform exceptionally 
well for certain classes of algorithms. While not all 
TLAV programs consistently converge to the same 
values depending on system implementation, syn¬ 
chronous systems are almost always deterministic, 
making synchronous applications easy to design, 
program, test, debug, and deploy. Although coor¬ 
dinating synchronization imposes consistent over¬ 
head, the overhead becomes largely amortized for 
large graphs. Synchronous systems demonstrate 
good scalability, with runtime often linearly increas¬ 
ing with the number of vertices lf80l . As will 
be discussed in Section III B 1 synchronous sys¬ 
tems are often implemented along with message¬ 
passing communication, which enables a more ef¬ 
ficient “batch messaging” method. Batch mes¬ 
saging can especially benefit systems with lots of 
network traffic induced by algorithms with a low 
computation-to-communication ratio HUD- 


Although synchronous systems are conceptually 
straight-forward and scale well, the model is not 
without drawbacks. One study found that syn¬ 
chronization, for an instance of finding the short¬ 
est path in a highly-partitioned graph, accounted for 
over 80% of the total running time ||23| . so system 
throughput must remain high to justify the cost of 
synchronization, since such coordination can be rel¬ 
atively costly. However, when the number of ac¬ 
tive vertices drops or the workload amongst work¬ 
ers becomes imbalanced, system resources can be¬ 
come under-utilized. Iterative algorithms often suf¬ 
fer from “the curse of the last reducer” otherwise 
known as the “straggler” problem where many com¬ 
putations finish quickly, but a small fraction of com¬ 
putations take a disproportionately longer amount 
of time El. For synchronous systems, each su¬ 
perstep takes as long as the slowest vertex, so syn¬ 
chronous systems generally favor lightweight com¬ 
putations with small variability in runtime. 


Finally, synchronous algorithms may not con¬ 
verge in some instances. In graph coloring al¬ 
gorithms, for example, vertices attempt to choose 
colors different than adjacent neighbors Il42l and 
require coordination between neighboring vertices. 
However, during synchronous execution, the cir¬ 
cumstance may arise where two neighboring ver¬ 
tices continually flip between each others’ color. In 
general, algorithms that require some type of neigh¬ 
bor coordination may not always converge with the 







7 


Framework 

Timing 

Pregel 

Synchronous 

da 

Giraph 

Synchronous 

m 

Hama 

Synchronous 

(1121 

GraphLab 

Asynchronous (7411751 

PowerGraph 

Both 

ED 

PowerSwitch 

Hybrid 

ED 

GRACE 

Hybrid 

fl35l 

GraphHP 

Hybrid 

(23) 

P++ 

Hybrid 

(EG) 


TABLE I: Execution timing model of selected 
frameworks. 


synchronous timing model without the use of some 
extra logic in the vertex program ED- 


2. Asynchronous 

In the asynchronous iteration model, no explicit 
synchronization points, i.e., barriers, are provided, 
so any active vertex is eligible for computation 
whenever processor and network resources are avail¬ 
able. Vertex execution order can be dynamically 
generated and reorganized by the scheduler, and 
the “straggler” problem is eliminated. As a result, 
many asynchronous models outperform correspond¬ 
ing synchronous models, but at the expense of added 
complexity. 

Theoretical and empirical research has demon¬ 
strated that asynchronous execution can generally 
outperform synchronous execution US EH, albeit 
precise comparisons for TLAV frameworks depend 
on a number of properties ED- Asynchronous 
systems especially outperform synchronous systems 
when the workload is imbalanced. For example, 
when computation per vertex varies widely, syn¬ 
chronous systems must wait for the slowest compu¬ 
tation to complete, while asynchronous systems can 
continue execution maintaining high throughput. 
One disadvantage, however, is that asynchronous 
execution cannot take advantage of batch messag¬ 
ing optimizations (see Section [III B4| i. Thus, syn¬ 
chronous execution generally accommodates I/O- 
bound algorithms, while asynchronous execution 
well-serves CPU-bound algorithms by adapting to 
large and variable workloads. 

Many iterative algorithms exhibit asymmetric 
convergence. Low et al. demonstrated that, for 
PageRank, the majority of vertices converged within 
one superstep, while only 3% of vertices required 
more than 10 supersteps El- Asynchronous sys¬ 
tems can utilize prioritized computation via a dy¬ 


namic schedule to focus on more challenging com¬ 
putations early in execution to achieve better perfor¬ 
mance l74l 11541 . Generally, asynchronous systems 
perform well by providing more execution flexibil¬ 
ity, and by adapting to dynamic or variant work¬ 
loads. 

Although intelligent scheduling can improve per¬ 
formance, schedules resulting in sub-optimal per¬ 
formance are also possible. In some instances, a 
vertex may perform more updates than necessary 
to reach convergence, resulting in excessive com¬ 
putation 1115211 . Moreover, if implementing the pull 
model of execution, which is commonly imple¬ 
mented in asynchronous systems ll74ll and described 
in Section |III C 2| communication becomes redun¬ 
dant when neighboring vertex values don’t change 

ED nsa. 

The flexibility provided by asynchronous execu¬ 
tion comes at the expense of added complexity, not 
only from scheduling logic, but also from maintain¬ 
ing data consistency. Asynchronous systems typi¬ 
cally implement shared memory, discussed in Sec¬ 
tion |III B 2| where data race conditions can occur 
when parallel computations simultaneously attempt 
to modify the same data. Additional mechanisms 
are necessary to ensure mutual exclusion, which can 
challenge algorithm development because frame¬ 
work users may have to consider low-level concur¬ 
rency issues 111 3511 . like, for example, in GraphLab 
where users must select a consistency model ED- 


3. Hybrid 

Rather than adhering to the inherent strengths 
and weaknesses of a strict execution model, sev¬ 
eral frameworks work around a particular shortcom¬ 
ing through design improvements. One such im¬ 
plementation, GraphHP, reduces the high fixed cost 
of the global synchronization barrier using pseudo¬ 
supersteps l23l . Another implementation, GRACE, 
explores dynamic scheduling within a single su¬ 
perstep Il35l . The PowerSwitch system removes 
the need to choose between synchronous and asyn¬ 
chronous execution and instead adaptively switches 
between the two modes to improve performance 
ED- Together, these three frameworks illustrate 
how weaknesses with a particular execution model 
can be overcome through engineering and problem 
solving, rather than strict adoption of an execution 
model. 

As previously discussed, synchronous systems 
suffer from the high, fixed cost of the global syn¬ 
chronization barrier. The hybrid execution model in¬ 
troduced by GraphHP, and also used by P++ frame¬ 
work 053, reduces the number of supersteps by de- 
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coupling intra-processor computation from the inter¬ 
processor communication and synchronization ll23l . 
To do this GraphHP distinguishes between two types 
of nodes: boundary nodes that share an edge across 
partitions, and local nodes that only have neighbor¬ 
ing nodes within the local partition. During syn¬ 
chronization, messages are only exchanged between 
boundary nodes. As a result, in GraphHP, a given 
superstep is composed of two phases: global and lo¬ 
cal. The global phase, which is executed first, runs 
the user program across all boundary vertices us¬ 
ing data transmitted from other boundary vertices as 
well as its own local vertices. Once the global phase 
is complete, the local phase executes the vertex pro¬ 
gram on local vertices within a pseudo-superstep; 
the pseudo-superstep is different from a regular su¬ 
perstep in that: 1) pseudo-supersteps have local bar¬ 
riers resulting in local iterations independent of any 
global synchronization or communication; and 2) 
local message passing is done through direct, in¬ 
memory message passing, which is much faster than 
standard MPI-style messages. 

A similar approach to segmented execution, as in 
GraphHP and P++, is the KLA paradigm |52l , which 
creates a hybrid of synchronous and asynchronous 
execution. For graphs, the depth of asynchronous 
execution is parameterized, and asynchronous exe¬ 
cution is allowed for a certain number of levels be¬ 
fore a synchronous round. Similar to how GraphHP 
implements a round of boundary vertex execution 
before several rounds of local execution, KLA has 
multiple traversals of asynchronous execution be¬ 
fore coordinating a round of synchronous execution. 
The trade-off is between expensive global synchro¬ 
nizations with cheap but possibly redundant asyn¬ 
chronous computations. KLA is also similar to 
delta-stepping used for single source shortest path 

S82). 

The single-machine framework GRACE explores 
dynamic scheduling of vertices from within a sin¬ 
gle synchronous round ! 1351 . To do this GRACE ex¬ 
poses a programming interface that, from within a 
given superstep, allows for prioritized execution of 
vertices and selective receiving of messages outside 
of the previous superstep. Results demonstrate com¬ 
parable runtime to asynchronous models, with better 
scaling across multiple worker threads on a single 
machine. 

Knowing a priori which execution mode will per¬ 
form better for a given problem, algorithm, system, 
or circumstance is challenging. Furthermore, the un¬ 
derlying properties that give one execution model an 
advantage over another may change over the course 
of processing. For example, in the distributed Sin¬ 
gle Source Shortest Path algorithm ED, the process 
begins with few active vertices, where asynchronous 


execution is advantageous, then propagates to a high 
number of active vertices performing lightweight 
computations, which is ideal for synchronous exe¬ 
cution, before finally converging amongst few ac¬ 
tive vertices ED- For some algorithms, one execu¬ 
tion mode may outperform another only for certain 
stages of processing, and the best mode at each stage 
can be difficult to predict. 

Motivated by the necessity for execution mode 
dynamism, PowerSwitch was developed to adap¬ 
tively switch between synchronous and asyn¬ 
chronous execution modes ED- Developed on 
top of the PowerGraph platform, PowerSwitch can 
quickly and efficiently switch between synchronous 
and asynchronous execution. PowerSwitch incor¬ 
porates throughput heuristics with online sampling 
to predict which execution mode will perform bet¬ 
ter for the current period of computation. Re¬ 
sults demonstrate that the PowerSwitch’s heuristics 
can accurately predict throughput, the switching be¬ 
tween the two execution modes is well-timed, and 
overall runtime is improved for a variety of algo¬ 
rithms and system configurations ED. 


B. Communication 


Communication in TLAV frameworks entails how 
data is shared between vertex programs. The 
two conventional models for communication in dis¬ 
tributed systems, as well as distributed algorithms, 
are message passing and shared memory mm 
11461 . In message passing systems, data is exchanged 
between processes through messages, whereas in 
shared memory systems data for one process is di¬ 
rectly and immediately accessible by another pro¬ 
cess. This section compares and contrasts message 
passing and shared memory for TLAV frameworks. 
A third method of communication, active messages, 
is also presented. Finally, techniques to optimize 
distributed message passing are discussed. 


Diagrams in Figure [4] are referenced throughout 
this section to illustrate the different communication 
implementations. A sample graph is presented in 
Figure 4a and Figures |4b||4e1 depict 4 TLAV com¬ 
munication implementations of the sample graph. 
For each implementation, vertices are partitioned 
across 2 machines, namely, vertices A, B, and C are 
partitioned to machine pi, and vertices D, E, and 
F are put on machine p2 (except Figure [4d] and 4e 


where the graph is cut along vertex C). Solid arrows 
represent local communication[? ] and dashed ar¬ 
rows represent network traffic. 
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(a) Sample Graph 





(e) Active Msgs w/Agent-Graph 
Scatter Vertex 


FIG. 4: Distributed communication patterns for common communication implementations. The sample 
graph is partitioned across two machines (see Section III D| ), with vertices A, B, and C residing on machine 
pi, and vertices D, E, and F on machine p2. Pregel is represented in (b), GraphLab in (c), PowerGraph in 

(d), and GRE in (e). 


1. Message Passing 

In the message passing method of communica¬ 
tion, also known as the LOCAL model of distributed 
computation j94j, information is sent from one ver¬ 
tex program kernel to another via a message. A 
message contains local vertex data and is addressed 
to the ID of the recipient vertex. In the archetypal 
message-passing framework Pregel l80l . a message 
can be addressed anywhere, but because vertices do 
not have ID information of all of other vertices, des¬ 
tination vertex IDs are typically obtained by iterat¬ 
ing over outgoing edges. 

After computation is complete and a destination 
ID for each message is determined, the vertex dis¬ 
patches messages to the local worker process. The 
worker process determines whether the recipient re¬ 
sides on the local machine or a remote machine. In 
the case of the former, the worker process can place 
the message directly into the vertex’s incoming mes¬ 
sage queue. Else, the worker process looks up the 
worker-id of the destination vertex[? ] and places 
the message in an outgoing message buffer. The 
outgoing message buffer in Pregel, a synchronously- 
timed system, is flushed when it reaches a cer¬ 
tain capacity, sending messages over the network in 
batches. Waiting until the end of a superstep to send 
all outgoing remote messages can exceed memory 
limits ifTTfl . 

Message passing is commonly implemented with 


synchronized execution, which guarantees data con¬ 
sistency without low-level implementation details. 
All messages sent during superstep S are received 
in superstep S + 1, at which point a vertex pro¬ 
gram can access the incoming message queue at 
the beginning of S + l’s program execution. Syn¬ 
chronous execution also facilitates batch messaging, 
which improves network throughput. For I/O bound 
algorithms with lightweight computation, such as 
PageRank m. where vertices are “always active” 
so messaging is high 111 141 . synchronous execution 
has been shown to significantly outperform asyn¬ 
chronous execution am 


Message passing is depicted in Figure |4b} where 
vertex C sends (an) inter-machine message(s) to ver¬ 
tices D , E, and F. Technically, messages are first 
sent from C to the worker process of pi, which 
routes the messages to worker process p2, which 
places the message in a vertex’s incoming mes¬ 
sage queue, but the worker process-related routing 
is omitted from the figure without loss of general¬ 
ity. Figure 4b represents a general message pass¬ 
ing framework, such as Pregel or Giraph. The three 
messages sent by C across the network can be poten¬ 
tially reduced using optimization techniques in Sec¬ 
tion III B 4 namely. Receiver-side Scatter, depicted 
in Figure |5c| 
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2. Shared Memory 


Shared memory exposes vertex data as shared 
variables that can be directly read or be modified 
by other vertex programs. Shared memory avoids 
the additional memory overhead constituted by mes¬ 
sages, and doesn’t require intermediate process¬ 
ing by workers. Shared memory is often imple¬ 
mented by TLAV frameworks developed for a sin¬ 
gle machine (see Section |IV D| >, since challenges 
to a shared memory implementation arise in the 
distributed setting esi nm where consistency 
must be guaranteed for remotely-accessed vertices. 
Inter-machine communication for distributed shared 
memory still occurs through network messages. The 
Trinity framework Bl 131 implements a shared global 
address space that abstracts away distributed mem¬ 
ory. 


For shared memory TLAV frameworks, race con¬ 
ditions may arise when an adjacent vertex resides on 
a remote machine. Shared memory TLAV frame¬ 
works often ensure memory consistency through 
mutual exclusion by requiring serializable sched¬ 
ules. Serializability, in this case, means that ev¬ 
ery parallel execution has a corresponding sequen¬ 
tial execution that maintains consistency, cf, the din¬ 
ing philosophers problem Eliza. 

In GraphLab l74l border vertices are provided 
locally-cached ghost copies of remote neighbors, 
where consistency between ghosts and the origi¬ 
nal vertex is maintained using pipelined distributed 
locking (35). In PowerGraph (43) . the second 
generation of GraphLab, graphs are partitioned by 
edges and cut along vertices (see vertex-cuts in Sec¬ 
tion III D i, where consistency across cached mir¬ 


rors of the cut vertex is maintained using parallel 
Chandy-Misra locking ED- GiraphX is a Giraph 
derivative with a synchronous shared memory im¬ 
plementation B125I . which again provides serializa¬ 
tion through Chandy-Misra locking of border ver¬ 
tices, although without local cached copies. The 
reduced overhead of shared memory compared to 
message passing is demonstrated by GiraphX, which 
converges 35% faster than Giraph when computing 
PageRank on a large Web Graph II 1251 . Moreover, 
some iterative algorithms perform better under seri¬ 
alized conditions, such as Dynamic ALS (7411158) . 
and popular Gibbs sampling algorithms that actually 
require serializability for correctness (42). 

Shared memory implementations are depicted in 
Figure 4c and Figure [4d] In Figure 4c ghost ver¬ 
tices, represented by dashed circles, are created for 
every neighboring vertex residing on a remote ma¬ 
chine, as implemented by GraphLab (741 . One dis¬ 
advantage of shared-memory frameworks is seen 
when computing on scale-free graphs which have a 


certain percentage of high degree vertices, such as 
vertex C. In these cases the graph can be difficult to 
partition ED resulting in many ghost vertices. 


Figure [4d]depicts shared memory with vertex cuts 
as implemented by PowerGraph ll43l . PowerGraph 
combines vertex-cuts (discussed in Section |III D| ) 
with the three-phase Gather-Apply-Scatter compu¬ 
tational model (see Section IIIC 1 c) to improve pro¬ 
cessing of scale-free graphs. In Figure[4dj the graph 
is cut along vertex C , where Cl is arbitrarily cho¬ 
sen as the master and C2 as the mirror. For each 
iteration, a distributed vertex preforms computation 
where: (i) both Cl and C2 compute a partial re¬ 
sult based on local neighbors, (ii) the partial result 
is sent over the network from the mirror C2 to the 
master Cl, (iii) the master computes the final re¬ 
sult for the iteration, (iv) the master transmits the 
result back to the mirror over the network, then (v) 
the result is sent to local neighbors as necessary. 
PowerGraph demonstrates how the combination of 
advanced components, i.e., vertex-cuts and three- 
phase computation, can overcome processing chal¬ 
lenges like imbalances arising from high-degree ver¬ 
tices in scale-free graphs. 


Shared memory systems are often implemented 
with asynchronous execution. Although consistency 
is fundamentally maintained in synchronous mes¬ 
sage passing frameworks like Pregel, asynchronous, 
shared memory frameworks like GraphLab may ex¬ 
ecute faster because of prioritized execution and 
low communication overhead, but at the expense of 
added complexity for scheduling and maintaining 
consistency. The added complexity challenges seal- 
ability, for as the number of machines and partitions 
increase, more time and resources become devoted 
to locking protocols. 


Dynamic computation addresses asymmetric con¬ 
vergence by only updating necessary vertices. 
Shared memory with asynchronous execution is an 
effective platform for dynamic computation, be¬ 
cause the movement of data is separated from com¬ 
putation, allowing vertices to access neighboring 
values even if the values haven’t changed between 
iterations. This implies the pull mode of informa¬ 
tion flow |IIIC2| In contrast, a vertex in a message¬ 
passing framework would need all neighboring val¬ 
ues delivered in order to perform an update, even 
if some values had not changed. Dynamic compu¬ 
tation is possible with message passing in the Cy¬ 
clops framework, which implements a distributed 
immutable view. Cyclops is a synchronous shared 
memory framework (24), where one of the repli¬ 
cated vertices is designated the master, which com¬ 
putes updates and messages the updated state to 
replicas at the end of an iteration. Cyclops out¬ 
performs synchronous message passing frameworks 
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by reducing the amount of processing performed by 
each worker parsing messages, and is comparable to 
PowerGraph by delivering significantly fewer mes¬ 
sages. 

Significant deterioration in performance was 
noted in |49l [771 for larger graphs, although admit¬ 
tedly performance largely depends on algorithm be¬ 
havior ins EE In short, asynchronous shared 
memory systems can potentially outperform syn¬ 
chronous message passing systems, though the latter 
often demonstrate better scalability and generaliza¬ 
tion. 


3. Active Messages 

While message passing and shared memory are 
the two most commonly implemented forms of com¬ 
munication in distributed systems, a third method 
called active messages is implemented in the GRE 
framework mat? ]. Active messaging is a way 
of bringing computation to data, where a message 
contains both data as well as the operator to be ap¬ 
plied to the data 11341 , Active messages are sent 
asynchronously, and executed upon receipt by the 
destination vertex. Within the GRE architecture, ac¬ 
tive messages combine the process of sending and 
receiving messages, removing the need to store in¬ 
termediate state, like message queues or edge data. 
When combined with the framework’s novel Agent- 
Graph model, described below, GRE demonstrates 
20%-55% reduction in runtime compared to Pow¬ 
erGraph across three benchmark algorithms in real 
and synthetic datasets, including 39% reduction in 
the execution time per iteration for PageRank on the 
Twitter graph when scaled across 192 cores over 16 
machines when compared to a PowerGraph imple¬ 
mentation on 512 cores across 64 machines 11461 , 

The GRE framework modifies the data graph into 
an Agent-Graph. The Agent-Graph is a model used 
internally by the framework, but is not accessible to 
the user. The Agent-Graph adds combiner and scat¬ 
ter vertices to the original graph in order to reduce 
inter-machine messaging. Figure [4e] shows that an 
extra scatter vertex, C", is added to create the in¬ 
ternal Agent-Graph model. The C vertex acts as 
a Receiver-side Scatter depicted in Figure [5c] This 
is useful because the new C' vertex allows C to 
only send one message across the network, which 
C' then disperses to vertices D, E, and F. Com¬ 
biner vertices are also added to the Agent-Graph in 
the same way as Server-side Aggregation depicted 
in Figure [5a] The Agent-Graph employed by GRE 
is similar to vertex-cuts in PowerGraph except that 
GRE messaging is unidirectional, and active mes¬ 
sages are also utilized for parallel graph computation 


in the Active Pebbles framework l36l 1 1401 . 


4. Message Passing Optimizations 

Message passing can be costly, especially over 
a network. Thus several message-reducing strate¬ 
gies have been developed in order to improve per¬ 
formance. Some strategies are topology-driven and, 
as such, exploit the graph layout across machines, 
while other techniques are applied to specific algo¬ 
rithmic behavior. Three topology-driven optimiza¬ 
tions are depicted in Figure[5]for messaging between 
machines pi and p2 (or messaging from pi, p2, and 
p3 to p'l, for Figure [5b|. 

The Combiner, inspired by the MapReduce func¬ 
tion of the same name l32l . is a message passing op¬ 
timization originally used by Pregel 18011 . Presuming 
the commutative and associative properties of a ver¬ 
tex function, a Combiner executes on a worker pro¬ 
cess and combines many messages destined for the 
same vertex into a single message. For example, if 
a vertex function computes the sum of all incoming 
messages, then a Combiner would detect all mes¬ 
sages destined for a vertex v, compute the sum of 
the messages, then send the new sum to v. A Com¬ 
biner can especially reduce network traffic when v 
is remote, shown as sender-side aggregation (Fig- 
ure[5a]i. When v is local, a combiner can still reduce 
memory overhead by aggregating messages before 
placement into the incoming message queue, shown 
as receiver-side aggregation (Figure |5b| ). For the 
single-source shortest path algorithm, a combiner 
implementation resulted in a four-fold reduction in 
network traffic |8Q| . 

A related technique is the receiver-side scatter. 
For instances where the same message is sent to 
multiple vertices on the same remote machine, net¬ 
work traffic can be reduced by sending only one 
message and then having the destination worker 
distribute multiple copies, depicted in Figure [5c] 
The strategy has been employed in multiple frame¬ 
works, including the Large Adjacency List Partition¬ 
ing in GPS IM), IBM’s X-Pregel QD, as the/efc/r- 
once behavior in FFGraph |54) . and through scatter 
nodes of the Agent-Graph in GRE ll46l . The tech¬ 
nique reduces network traffic by increasing mem¬ 
ory and processing overhead, as worker-nodes must 
store the out-going adjacency lists of other work¬ 
ers. With this in mind, GPS maintains a threshold 
where receiver-side scatter would only be applied 
for vertices above a certain degree. Experiments 
showed that as the threshold is lowered, network 
traffic at first decreases then plateaus, while runtime 
decreases but then increases, demonstrating the ex¬ 
istence of an optimal vertex-degree threshold. In X- 
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(a) Sender-side Combiner 


(b) Receiver-side Combiner 


(c) Receiver-side Scatter 


FIG. 5: Partition-driven optimization strategies for distributed message passing. The Combiner technique 
employs both Sender-side and Receiver-side Combiners. 


Pregel, a ten-fold reduction in network traffic from 
Receiver-side Scatter resulted in a 1.5 times speedup 
d. Clearly, the receiver-side scatter strategy can be 
effective, but unlike the combiner is not guaranteed 
to improve performance. 

The three partition-driven optimizations in Fig¬ 
ure [5] are related to the messaging structure of a 
framework, and not specific to algorithm behavior, 
albeit some assumptions are made regarding mes¬ 
sage computation. Computation for the combiner 
must be commutative and associative because or¬ 
der cannot be guaranteed, while messages for the 
receiver-side scatter must be identical, and indepen¬ 
dent of the adjacency list. Still, the techniques are 
oriented around partition-level messaging and apply 
to the worker process, only requiring certain oper¬ 
ational properties in order to work. The Message- 
Online-Computing model proposed in 11571 . which 
improves memory usage by processing messages in 
the queue as they are delivered, also requires opera¬ 
tions be commutative. 

Conversely, algorithm-specific message optimiza¬ 
tions have also been developed that restructure ver¬ 
tex messaging patterns for certain algorithmic be¬ 
haviors non mol . For algorithms that combine 
vertices into a supervertex, like Boruvka’s Mini¬ 
mum Spanning Tree (30), the Storing Edges at Sub¬ 
vertices (SEAS) optimization implements a subrou¬ 
tine where each vertex tracks its parent supervertex 
instead of sending adjacency lists Em For al¬ 
gorithms where vertices remove edges, like in the 
1/2-approximation for maximum weight matching 
(99), the Edge Cleaning on Demand (ECOD) op¬ 
timization only deletes stale edges when, counter¬ 
intuitively, activity is requested for the stale edge 
11 101 . To avoid slow convergence, ECOD is only 
employed above a certain threshold, e.g., when more 
than 1% of all vertices are active. Both SEAS and 
ECOD exploit a trade-off between sending messages 
proportional to the number of vertices or propor¬ 
tional to the number of edges. Other strategies for 


reducing communication, based on aggregate com¬ 


putation, are discussed in Section V C 


C. Execution Model 

The model of execution for vertex-centric pro¬ 
grams describes the implementation of the vertex 
function, and how data moves during computation. 


1. Vertex Program Implementation 

Vertex functions have been implemented as 1,2, 
or 3 phase-models. Vertex functions have also been 
implemented as edge-centric functions. While the 
model choice does not typically impact the accuracy 
of the final result, combining certain implementa¬ 
tions with other TLAV components can yield im¬ 
proved system performance for certain graph char¬ 
acteristics. 

a. One Phase The vertex programming ab¬ 
straction implemented as a single function is well- 
characterized by the Pregel framework (80) . The 
single compute function of a vertex object fol¬ 
lows the general sequence of accessing input data, 
computing a new vertex value, and distributing 
the update. In a typical Pregel program, the in¬ 
put data is accessed by iterating through the input 
message queue (messages that may have utilized 
a combiner), applying an update function based 
on received data, and then sending the new value 
through messages addressed by iterating over out¬ 
going edges. Details based on other design deci¬ 
sions may vary, e.g., input and output data may be 
distributed through incident edges, or neighboring 
vertex data may be directly accessible, but in one- 
phase models the general sequence of vertex execu¬ 
tion is performed within a single, programed func¬ 
tion. The Vertex . Compute () function is imple¬ 
mented in several TLAV frameworks in addition to 
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Pregel, including its open-source implementations 
lf7l II 12 3 and several related variants [|8i 11061 [109 'I. 
The One-phase function implementation is concep¬ 
tually straight-forward, but other frameworks pro¬ 
vide opportunities for improvement by dividing up 
the computation. 


b. Two Phase A two-phase vertex-oriented 
programming model breaks up vertex programming 
into two functions, most commonly referred to as the 
Scatter-Gather model. In Scatter-Gather, the scat¬ 
ter phase distributes a vertex value to neighbors, 
and the gather phase collects the inputs and applies 
the vertex update. While most single-phase frame¬ 
works e.g., Pregel, can be converted into two phases, 
the Scatter-Gather model was first explicitly put for¬ 
ward in the Signal/Collect framework ltl23l . The 
two phase model is also presented as Scatter-Gather 
in DSD, and is presented as the Iterative Vertex- 
Centric (IVEC) programming model in lfl47) . The 
Scatter-Gather programming model commonly oc¬ 
curs in TLAV systems where data is read/written 
to/from edges. 


Ligra and Polymer are frameworks implemented 
for single-machines (see Section IV D| > that both 
implement a two-phase model. The user provides 
two functions, one function that executes across 
each vertex in the active subset and another func¬ 
tion that executes all outgoing edges in the subset. 
The frameworks adopt a vertex-subset-centric pro¬ 
gramming model, which is similar to vertex-centric, 
but the framework retains a centralized view of the 
graph, where the whole graph is within the scope 
of computation, which is possible because the en¬ 
tire graph resides on a single machine in this case. 
The two phase model is executed within a program 
processing the whole graph. 


A related two-phase programming model for mes¬ 
sage passing called Scatter-Combine is implemented 
in the GRE framework II 1461 . This model utilizes 
active messages, which are messages that include 
both data as well as the operator to be executed 
on the data 11341 . In the first phase of the model, 
messages are both sent (Scattered) and the opera¬ 
tors in the messages are executed (Combined) at the 
destination vertex. In the second phase, the com¬ 
bined result is used to update the vertex value. The 
Scatter-Combine model incorporates two phases dif¬ 
ferently than Scatter-Gather. Instead of the two 
phase Scatter-Gather model of (i) Gather-Apply, and 
(ii) Scatter, the Scatter-Combine model uses active 
messages to institute (i) Scatter-Gather, and then 
(ii) Apply. The GRE framework combines Scatter- 
Combine with a novel representation of the under¬ 
lying data graph, called the Agent-Graph, described 
above, to reduce communication and improve seal- 
ability for processing graphs with scale-free degree 


distributions. 

c. Three Phase A three-phase programming 
model is introduced in PowerGraph as the Gather- 
Apply-Scatter (GAS) model (43). The Gather phase 
performs a generic summation over all input vertices 
and/or edges, like a commutative associative com¬ 
biner. The result is used in the Apply phase, which 
updates the central vertex value. The Scatter phase 
distributes the update by writing the value to the out¬ 
put edges. PowerGraph incorporates the GAS model 
with vertex-cut partitioning (see Section III D 3 i to 
improve processing of power-law graphs. 

d. Edge-Centric The X-Stream framework 
provides an edge-centric two phase Scatter-Gather 
programming model cm as opposed to a 
vertex-centric programming model. The model is 
edge-centric because the framework iterates over 
edges of the graph instead of vertices. However, the 
framework may still be considered TLAV because 
the two phase program operates on source and target 
vertices, adopting a similar local scope. X-Stream 
leverages streaming edge data instead of random 
access for efficient large scale graph processing on 


a single machine, and is discussed in Section IV D 
in further detail. 


2. Push vs. Pull 

The flow of information for vertex-programs can 
be characterized as data being pushed or pulled 
l29l [511 [86i l. In push mode, information flows from 
the active vertex performing the update outward 
to neighboring vertices, as in Pregel-like message¬ 
passing. In pull mode, information flows from 
neighboring vertices inward to the active vertex, as 
in GraphLab-like shared memory, when an active 
vertex reads neighbor’s data. Few TLAV frame¬ 
works explicitly adopt a push or pull mode. Instead, 
the information flow arises from other design deci¬ 
sions. Still, analyzing a system as push or pull al¬ 
lows one to reason about other system properties. 
For example, asynchronous execution is supported 
by both modes, but sender-side combining is only 
possible in push mode ||29| . 

Push and pull modes are more commonly as¬ 
sociated with databases and transactional process¬ 
ing, though have been more explicitly incorporated 
in broader graph engines and temporal frameworks 
(see Section[VI]for related work). The Galois frame¬ 
work, with a flexible computation model enabling 
the implementation of a vertex-centric interface, al¬ 
lows users to choose push or pull mode 1681186 1. as 
does Kineograph (29). Chronos experiments with 
how push and pull modes impact caching E). 

Ligra is a single-machine graph processing frame- 
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work that dynamically switches between push and 
pull-based operators based on a threshold. The 
framework is in part inspired by a recently devel¬ 
oped shared-memory breadth-first search algorithm 
that achieves remarkable performance by switching 
between push and pull modes of exploration 0. 
This algorithm, Ligra, and PowerSwitch from Sec¬ 
tion III A 3 exemplify how performance can be im¬ 


proved by dynamically adapting the processing tech¬ 
nique to properties of the graph. 

The delta-caching optimization, which is intro¬ 
duced in PowerGraph li43l . which reduces the 
pulling of redundant data by tracking value changes. 
In a three phase model, an accumulator value is the 
result of gather step. With delta-caching, a cached 
copy of the accumulator for each vertex is stored by 
the worker, requiring additional storage. If, for a 
given update, the change in the accumulator is min¬ 
imal, then neighboring vertices aren’t activated, and 
any change can be applied to the cached copy stored 
by worker. A neighboring vertex can then use the 
cached copy during an update. For delta-caching 
to be available, the apply function must be com¬ 
mutative, associative, and have an inverse function. 
Delta-caching reduces redundant pulling by not ac¬ 
tivating neighboring vertices for small changes, and 
resulted in a 45% decrease in runtime for computing 
PageRank on the Twitter graph | 


D. Partitioning 

Large-scale graphs must be divided into parts to 
be placed in distributed memory. Good partitions 
often lead to improved performance 111091 . but ex¬ 
pensive strategies can end up dominating process¬ 
ing time, leading many implementations to incor¬ 
porate simple strategies, such as random placement 
[56.1 - Effective partitioning evenly distributed the 
vertices for balanced workload, while minimizing 
inter-partition edges to avoid costly network traffic, 
a problem formally known as k-way graph partition¬ 
ing that is NP-complete with no fixed-factor approx¬ 
imation Il6ll83ft. 

Leading work in graph partitioning can be broadly 
characterized as (1) rigorous but impractical mathe¬ 
matical strategies, or (2) pragmatic heuristics used 
in practice 111281 . Practical strategies, such as 
those employed in the suite of algorithms known as 
METIS |60l , often employ a three-phase multi-level 
partitioning approach m. Partition size is often al¬ 
lowed to deviate in the form of a “slackness” param¬ 
eter in exchange for better cuts l6Tl . 

Graph partitioning with METIS partitioning soft¬ 
ware is often considered the de facto standard 
for near-optimal partitioning in TLAV frameworks 


11221 . Despite a lengthy preprocessing time, 
METIS-algorithms significantly reduce total com¬ 
munication and improve overall runtime for TLAV 
processing on smaller graphs GQ9]. However, for 
graphs of even medium-size, the high computa¬ 
tional cost and necessary random access the entire 
graph renders METIS and related heuristics imprac¬ 
tical. Alternatives for large-scale graph partition¬ 
ing include distributed heuristics presented in Sec- 
IIID 1 streaming algorithms in Section [HID 2 


tion 


vertex cuts in Section HID 3| and dynamic reparti¬ 


tioning in Section III D 4 


1. Distributed Heuristics 

Distributed heuristics are decentralized methods, 
requiring little or no centralized coordination. Dis¬ 
tributed partitioning is related to distributed com¬ 
munity detection in networks m im the two 
main differences being: 1) communities can over¬ 
lap whereas partitions cannot, and 2) partitioning re¬ 
quires a priori specification of the number of parti¬ 
tions, whereas community detection typically does 
not. Much distributed partitioning work has been in¬ 
spired by distributed community detection, namely 
label propagation 11021 . 

Label propagation occurs at the vertex level, 
where each vertex adopts the label of the plural¬ 
ity of its neighbors. Though the process is de¬ 
centralized, label propagation for partitioning ne¬ 
cessitates a varying amount of centralized coordi¬ 
nation in order to maintain balanced partitions and 
prevent “densification”: a cascading phenomenon 
where one label becomes the overwhelming prefer¬ 
ence 11021 . The densification problem is addressed 
in am wherein a simple capacity constraint is en¬ 
forced that is equal to the available capacity of the 
local worker divided by the number of non-local 
workers. In B 1291 . balanced vertex distribution is 
maintained by constraining label propagation and 
solving a linear programming optimization prob¬ 
lem that maximizes a relocation utility function. In 
nnu, vertices swap labels, either with a neighbor or 
possibly a random node, and simulated annealing is 
employed to escape local optima. The cost of cen¬ 
tralized coordination incurred by these methods is 
much less than the cost of random vertex access on 
a distributed architecture, as with ParMETIS. 

More advanced label propagation schemes for 
partitioning are presented in 111361 and 111201 . In 
11361 . label propagation is used as the coarsening 
phase of a multi-level partitioning scheme, which 
processes the partitioning in blocks to accommo¬ 
date multi-level partitioning for large-scale graphs, 
in ma several stages of label propagation are uti- 
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lized to satisfy multiple partitioning objectives under 
multiple constraints. 115011 use a parallel multi-level 
partitioning algorithm for k-way balanced graphs 
that operates in two phases: an aggregate phase that 
uses weighted label propagation, and then a partition 
phase that performs the stepwise minimizing Ratio- 
Cut method. 


2. Streaming 

Streaming partitioning is a form of online pro¬ 
cessing that partitions a graph in a single-pass. 
For TLAV frameworks, streaming partitioning is 
especially efficient since the partitioning can be 
performed by the graph loader, which loads the 
graph from disk onto the cluster. The accepted 
streaming model assumes a single, centralized graph 
loader that reads data serially from disk and chooses 
where to place the data amongst available workers 
IH221 fT28l . Centralized streaming heuristics can 
be adapted to run in parallel 111221 . however, de¬ 
pending on the heuristic, concurrency between the 
parallel partitioners would likely be required |i89|. 
One of the first online heuristics was presented by 
Kernighan and Lin and is used as a subroutine in 
METIS El. GraphBuilder |[56l is a a similar li¬ 
brary that, in addition to partitioning, supports an 
extensive variety of graph loading-related process¬ 
ing tasks. A streaming partitioner on a graph loader 
reads data serially from disk, receiving one vertex at 
a time along with its neighboring vertices. In a sin¬ 
gle look at the vertex the streaming partitioner must 
decide the final placement for the vertex on a worker 
partition, but the streaming partitioner has access to 
the entire subgraph of already placed vertices. In a 
variant of the streaming model, the partitioner has an 
available storage buffer with a capacity equal to that 
of a worker partition, so the partitioner may tem¬ 
porarily store a vertex and decide the partitioning 
later cm however this buffer is not utilized by 
the top performing streaming parti tioners. For most 
heuristics, the placement of later vertices is depen¬ 
dent on placement of earlier vertices, so the presen¬ 
tation order of vertices can impact the partitioning. 
Thus, an adverse ordering can drastically subvert 
partitioning efforts, however, experiments demon¬ 
strate that performance remains relatively consistent 
for breadth-first, depth-first, and random orderings 
of a graph 1122 1281 . 

Two top-performing streaming partitioning algo¬ 
rithms are greedy heuristics. The first is linear de¬ 
terministic greedy (LDG), a heuristic that assigns 
a vertex to the partition with which it shares the 
most edges while weighted by a penalty function 
linearly associated with a partition’s remaining ca¬ 


pacity. The LDG heuristic is presented in 11221 . 
where 16 streaming partitioning heuristics are evalu¬ 
ated across 21 different data sets. The use of a buffer 
in addition to the LDF heuristic has been adapted for 
streaming partitioning of massive Resource Descrip¬ 
tion Framework (RDF) data fll38l . Another vari¬ 
ant uses unweighted deterministic greedy instead 
of linear deterministic greed (LDG), to perform 
greedy selection based on neighbors without any 
penalty function; this unweighted variant has been 
employed for distributed matrix factorization 0. 
Further analysis of LDG-related heuristics on ran¬ 
dom graphs, as well as lower bound proofs for ran¬ 
dom and adversarial stream ordering, is presented in 

cm 

Another top-performing streaming partitioner is 
FENNEL Il28l . which is inspired by a general¬ 
ization of optimal quasi-cliques Gm FENNEL 
achieves high quality partitions that are in some in¬ 
stances comparable with near-optimal METIS parti¬ 
tions. Both FENNEL and LDG have been adapted 
to the restreaming graph partitioning model, where 
a streaming partitioner is provided access to previ¬ 
ous stream results (89). Restreaming graph parti¬ 
tioning is motivated by environments such as on¬ 
line services where the same, or slightly modified, 
graph is repeatedly streamed with regularity. De¬ 
spite adhering to the same linear memory bounds as 
a single-pass partitioning, the presented restreaming 
algorithms not only provide results comparable to 
METIS, but are also capable of partitioning in the 
presence of multiple constraints and in parallel with¬ 
out inter-stream communication. 


3. Vertex Cuts 

A vertex-cut, depicted in Figure [4d] is equivalent 
to partitioning a graph by edges instead of vertices. 
Partitioning by edges results in each edge being as¬ 
signed to one machine, while vertices are capable of 
spanning multiple machines. Only changes to val¬ 
ues of cut vertices are passed over the network, not 
changes to edges. Vertex-cuts are implemented by 
TLAV frameworks in response to the challenges of 
finding well-balanced edge cuts in power-law graphs 
ID ED- Complex network theory suggests power- 
law graphs have good vertex cuts in the form of 
nodes with high degree 0. A rigorous review of 
vertex separators is presented in l39l . 

PowerGraph combines vertex-cuts with the three- 
phase GAS model (Section |III C 1 c| > for efficient 
communication and balanced computation ||43ll . For 
vertices that are cut and span multiple machines, 
one copy is randomly designated the master, and re¬ 
maining copies are mirrors. During an update all 
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vertices first execute a gather, where all incoming 
edge values are combined with a commutative as¬ 
sociative sum operation. Then the mirrors transmit 
the sum value over the network to the master, which 
executes the apply function to produce the updated 
vertex value. The master then sends the result back 
over the network to the mirrors. Finally, each vertex 
completes the update by scattering the result along 
its outgoing edges. For each update, network traffic 
is proportional to the number of mirrors, therefore, 
breaking up high-degree vertices reduces network 
communication and helps to balance computation. 

Since its initial implementation in PowerGraph, 
the vertex-cut approach has been adopted by sev¬ 
eral other TLAV frameworks. GraphX is a ver¬ 
tex programming abstraction for the Spark pro¬ 
cessing framework HU [m where the adoption 
of vertex-cuts demonstrated an 8-fold decrease in 
the platform’s communication cost. GraphBuilder 
ll56l . an open-source graph loader, supports vertex- 
cuts and implements grid and torus-based vertex- 
cut strategies that were later included in Power- 
Graph. PowerLyra (251 is a modification to Pow¬ 
erGraph that hybridizes partitioning where vertices 
with a degree above a user-defined threshold are cut, 
while vertices below the threshold are partitioned 
using an adaptation of the FENNEL streaming al¬ 
gorithm Hm PowerLyra also incorporates uni¬ 
directional locality similar to GRE framework (see 
Section [TUB 3| ). BiGraph is a framework developed 
on PowerGraph that implements partitioning algo¬ 
rithms for large-scale bipartite graphs (26]. Light- 
Graph 111551 is a framework that optimizes vertex- 
cut partitions by using edge-direction-aware parti¬ 
tioning, and by not sending updates to mirrors with 
only in-edges. 

Several edge partitioning analyses and algorithms 
have recently been developed. A thorough analysis 
comparing expected costs of vertex partitioning and 
edge partitioning is presented in fT5l . In this study, 
edge partitioning is empirically demonstrated to out¬ 
perform vertex partitioning, and a streaming least 
marginal cost greedy heuristic is introduced that out¬ 
performs the greedy heuristic from PowerGraph. 

Centralized hypergraph partitioning, including 
edge partitioning, is NP-hard, and several exact al¬ 
gorithms have been developed [33, 48] [66] [113| . 
However, because of their complexity, such algo¬ 
rithms are too computationally expensive and not 
practical for large-scale graphs. Centralized heuris¬ 
tics have been shown to be equally impractical 
E2- A large-scale vertex-cut approach for bipar¬ 
tite graphs based on hypergraph partitioning is pre¬ 
sented in (84l as part of a vertex-centric program 
for computing the alternating direction of multipli¬ 
ers optimization technique. A distributed edge parti- 


tioner was developed in 111031 that creates balanced 
partitions while reducing the vertex cut, based on 
the vertex partitioner in M104I . Good workload bal¬ 
ance for skewed degree distributions can also be 
achieved with degree-based hashing 111421 . Finally, 
as part of a non-vertex-centric BSP graph process¬ 
ing framework, a distributed vertex-cut partitioner 
is presented in (461 that uses a market-based model 
where partitions use allocated funds to buy an edge. 


4. Dynamic Repartitioning 

While an effective partitioning equally distributes 
vertices among the partitions, for TLAV frame¬ 
works, the number of active vertices performing up¬ 
dates on a given superstep can vary drastically over 
the course of computation, which creates processing 
imbalances and increases run time. Dynamic repar¬ 
titioning was developed to maintain balance during 
processing by migrating vertices between workers as 
necessary. 

Reasons for changing active vertex sets include 
topological mutations to the graph and algorithmic 
execution properties. Topological mutations may 
occur if the framework supports dynamic or tempo¬ 
ral graphs (see Related Work in Section [VT|). Topol¬ 
ogy may also change due to the algorithm, such as 
graph coarsening 111361 . 

With a static topology, the execution pattern of 
the algorithm can also change the active vertex 
set. While vertex algorithms such as synchronous 
PageRank execute on every vertex for every super¬ 
step, other algorithms introduce dynamism. 1333 
classifies 9 vertex algorithms as either (i) always ac¬ 
tive, (ii) traversal, or (iii) multi-phase, where the ac¬ 
tive vertex set of the latter two classifications can 
vary widely and unpredictably, depending on the 
graph. For dynamic repartitioning to prove benefi¬ 
cial, the associated overhead must be less than the 
additional costs stemming from processing imbal¬ 
ance. 

According to (1091 . a dynamic repartitioning 
strategy must directly address (i) how to select ver¬ 
tices to reassign, (ii) how and when to move the 
assigned vertices, and (iii) how to locate the re¬ 
assigned vertices. Other properties of a strategy 
include whether coordination is centralized or de¬ 
centralized, and how the strategy combats “densi- 
fication” and enforces vertex balance. Densifica- 
tion is akin to the rich-get-richer phenomenon, and 
can occur in greedy or decentralized protocols for 
partitioning/clustering, where one partition becomes 
over-populated as the repeated destination for mi¬ 
grated vertices (131ft . In response, protocols often 
implement constraints that prevent a partition from 
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exceeding a certain capacity. The XPregel frame¬ 
work, for example, only permits the worker with the 
most vertices and edges to migrate vertices Si- 

Table |II] presents 6 TLAV frameworks that sup¬ 
port dynamic repartitioning: GPS 11091 . Mizan l64l . 
XPregel ®, xDGP ifTUI . LogGP fEM, and the 
Catch the Wind prototype 11141 . The table in¬ 
cludes what active vertex set imbalances are targeted 
by the frameworks, what metrics are used to iden¬ 
tify vertices for reassignment, how reassigned ver¬ 
tices are located after migration, how densification 
is avoided, and whether the protocol is centralized 
or decentralized. 

Among the 6 frameworks that implement dy¬ 
namic repartitioning, all are synchronous, and repar¬ 
titioning occurs at the end of a superstep, separate 
from the updates. When a vertex is selected for mi¬ 
gration, the worker must send all associated data to 
the new worker, including the vertex ID, the adja¬ 
cency list, and the incoming messages to be pro¬ 
cessed in the next superstep. To avoid sending all in¬ 
coming messages over the network, many dynamic 
repartitioning frameworks implement a form of de¬ 
layed migration, where the new worker is recog¬ 
nized as the owner of the migrated vertex, but the 
vertex value remains on the old worker for an ex¬ 
tra iteration in order to compute an update. With 
delayed migration, the incoming message queue 
doesn’t need to be migrated, but the new worker still 
receives new incoming messages H641H091. 

Though fundamentally sound, many experiments 
demonstrate that dynamic repartitioning is often not 
worth the high overhead. Results in |8] show that 
while network I/O is significantly reduced over time, 
overall runtime shows minor improvements. Inde¬ 
pendent tests of GPS show dynamic repartitioning 
to be detrimental for all cases in ED, and simi¬ 
lar results are observed for GPS and Mizan in l49l . 
However, one major shortcoming in these evalua¬ 
tions is the use of the PageRank algorithm for exper¬ 
imentation. Dynamic repartitioning is most effec¬ 
tive for dynamic active vertex sets, but with PageR¬ 
ank vertices are always active, so dynamic reparti¬ 
tioning performs predictably poorly. Asynchronous 
dynamic repartitioning protocols have yet to be ex¬ 
plored for TLAV frameworks, but the added com¬ 
plexity and overhead for asynchrony demonstrated 
in Section III A suggest that such an implementation 
is not practical. 


IV. IMPLEMENTATION 


This section overviews implementation details of 
TLAV frameworks relating to the distributed envi¬ 
ronment. These details include system architecture 


and fault tolerance. Additionally, TLAV frameworks 
that employ novel techniques to process large-scale 
graphs on single machines are surveyed. 


A. System Architecture 


TLAV frameworks generally always employ the 
master-slave architecture. A master node initializes 
the slave workers, monitors execution, and man¬ 
ages coordination (and synchronization if invoked) 
amongst the workers. Generally, the master is re¬ 
sponsible for graph loading and partitioning, but 
with a network filesystem available, the loading and 
partitioning can be performed in parallel Il09l . The 
master also stores global values, such as aggregators 
lf80l . The workers each execute a copy of the pro¬ 
gram on the local partitions and inform the master 
of runtime status. 

One notable exception to the general master-slave 
architecture is XPregel 0, implemented in X10 
l22l . X10 implements an Asynchronous Partitioned 
Global Address Space (APGAS), which is a shared 
address space but with a local structure that enables 
highly productive distributed and parallel program¬ 
ming. With APGAS, the number of local “places” is 
provided at runtime, which the programmer may uti¬ 
lize as necessary. XPregel does implement master- 
slave, but in X10, the master is actually just place 
0, sans hierarchy, and opens the door for alternative 
architectures, like recursive structures. 


B. Multi-Core Support 


For multi-core machines, many BSP-based 
frameworks including Pregel isa simply assign a 
partition to a given core, but frameworks can bet¬ 
ter utilize computational resources through multi¬ 
threading. XPregel 0 supports multi-threading by 
dividing a partition into a user-defined number of 
subpartitions, assigning one thread to each subpar¬ 
tition. GraphLab 11741 implements multi-threading 
and avoids deadlocks through scheduler restrictions. 
GPS ll 1091 implements 3 types of threads: a thread 
for vertex computation, a thread for communication, 
and a thread for parsing. Cyclops |24j implements 
a hierarchical BSP model cm with a split design 
to parallelize computation and messaging while ex¬ 
ploiting locality and avoiding synchronization con¬ 
tention. Cyclops demonstrates that multi-threading 
can improve runtime relative to single-threaded ex¬ 
ecution for the same framework, at the expense of 
added complexity. 
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Framework 

Cause of Reassignment 

Imbalance Metric 

How to Locate Densification 

Migrated Verts Avoidance 

Coordination 

GPS 

Algorithm 

Sent Msgs 

Broadcast 

Vert ID 

Swap Min-Set 

Decentralized 

Mizan 

Algorithm 

Sent/Recv Msgs 

and Run Time 

Distributed 

Hash Table 

Metric-based Swap 

Decentralized 

XPregel 

Algorithm 

Sent/Recv Msgs 

Broadcast 

Worker ID 

Repartition 

Largest Worker 

Centralized 

xDGP 

Topology 

Labels of 

Neighbors 

Broadcast 

Worker ID 

Fraction of 

Capacity 

Decentralized 

LogGP 

Both 

Runtime 

Lookup Table 

Repartition Longest- 
Running Workers 

Centralized 

Catch 

the Wind 

Algorithm 

Sent/Recv Msgs 

Lookup Table 

Quota 

Decentralized 


TABLE II: Feature summary for TLAV frameworks that implement dynamic repartitioning. Active vertex 
set imbalance may arise from topology changes, algorithm execution, or both. A repartitioning strategy 
includes how to select vertices for reassignment, and how reassigned vertices are later located. Strategies 
should also avoid densification, and can be centrally or decentrally implemented. All implemented 

frameworks are synchronous. 


C. Fault Tolerance 

Distributed systems must often account for the 
potential failure of one or more nodes over the 
course of computation. When a node fails, a replace¬ 
ment node may become available, but all data and 
computation performed on the failed node is lost. 

Checkpointing is a common fault tolerance im¬ 
plementation, where an immutable copy of the data 
is written to persistent storage, such as a network 
filesystem. Pregel implements synchronous check¬ 
pointing, where the graph is copied in between su¬ 
persteps ll80l . When a failure occurs, the system 
rolls back to the most recently saved point, all par¬ 
titions are reloaded, and the entire system resumes 
processing from the checkpoint. The partition of 
the failed node is reloaded to a new replacement 
node. If messaging information is also logged, then 
resources can be saved by only reloading and re¬ 
computing data on the replacement node. GraphLab 
EH implements asynchronous vertex checkpoint¬ 
ing, based on Chandy-Lamport f20l snapshots, 
which need not halt the entire program and can result 
in slightly faster overall execution than synchronous 
checkpointing, minding certain program constraints. 

GraphX is a graph processing library for Apache 
Spark, which is developed based on the Re¬ 
silient Distributed Dataset (RDD) abstraction j44l . 
RDDs are immutable, partitioned collections created 
through data-parallel operators, like map or reduce. 
RDDs are either stored externally, or generated in¬ 
memory from operations on other RDDs. Spark 
maintains the lineage of operations on an RDD, so 


upon any node failure the RDD can be automatically 
recovered. GraphX leverages the RDDs of Spark to 
create a graph abstraction and Pregel interface. 

The Imitator |137| framework implements fault- 
tolerance based on vertex replicas, or ghosts/mirrors 
used in shared memory (see Section |IlIB2| i. The 
use of replicas for fault tolerance is founded in the 
observation that the hash partitioning of many real- 
world directed graphs results in the replication of 
over 99% of vertices US). By replicating every 
vertex, a full copy of the graph can reside in dis¬ 
tributed memory, enabling faster recovery times at 
the expense of relatively little additional memory 
consumption and network messaging 1113711 . The ef¬ 
ficiency of Imitator is tied to the effectiveness of 
the partitioning (see Section |IIID| >. Imitator out¬ 
performs checkpointing for large graphs distributed 
over several nodes, when only one replica per ver¬ 
tex is required. State-of-the-art partitioning meth¬ 
ods like METIS, or a smaller number of partitions 
(Imitator experiments were run on 50 nodes), would 
likely lead to increased overhead for Imitator. Also, 
the number of replicas is tied to the degree of fault 
tolerance. To support the failure of k machines, then 
k replicas are required, increasing overhead for each 
additional failure supported. 

A partition-based checkpoint method for fault tol¬ 
erance is presented in cm During execution, 
a recovery executor node collects run-time statis¬ 
tics, and upon failure, uses heuristics to redistribute 
the partitions. Checkpointed partitions of the failed 
nodes can be reassigned amongst both new and old 
nodes, parallelizing recovery. Partitions on healthy 
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nodes can also be reassigned for load balancing. 


D. Single Machine Architectures 

Like MapReduce, TLAV frameworks are advan¬ 
tageous because they are highly scalable while pro¬ 
viding a simple programming interface, abstracting 
away the lower level details of distributed comput¬ 
ing. However, such environments also stipulate the 
availability of elaborate infrastructure, cluster man¬ 
agement, and performance tuning, which may not be 
available to all users. 

Single machine systems are easier to manage and 
program, but commodity machines do not have the 
memory capacity to process large-scale graphs in¬ 
memory. This section overviews single machine 
TLAV frameworks that employ novel methods to 
process large-scale graphs. The main features of the 
4 single machine frameworks in this section are pre¬ 
sented in Table Hill 

Processing large-scale graphs on a single machine 
requires either substantial amounts of memory, or 
storing part of the graph out-of-memory, in which 
case performance is dictated by how efficiently the 
graph can be fetched from storage. In 033, it’s 
argued that high-end servers, offering 100GB to 
1TB of memory or more, is enough capacity for 
many real and synthetic graphs reported in the lit¬ 
erature. Such machines would be capable of storing 
large graphs and executing relatively simple graph 
algorithms, though more complex algorithms would 
likely exhaust resources. 

The recommendation service at Twitter El, 
which implements a single machine graph process¬ 
ing system with 144 GB of RAM, finds that in prac¬ 
tice one edge occupies roughly five bytes of RAM 
on average. Compression techniques are further 
explored for large memory servers in 111 1811 . Yet, 
graphs of scale are not practical on lower-end ma¬ 
chines containing around 8 to 16 GB of memory 
(69) . Accordingly, single machine frameworks have 
been developed that implement the vertex-centric 
programming model and process a graph in parts. 
Central to many single machine TLAV frameworks 
are novel data layouts that efficiently read and write 
graph data to/from external storage. One common 
representation is the compressed sparse row format, 
which organizes graph data as out-going edge adja¬ 
cency sets, allowing for the fast look-up of outgo¬ 
ing edge, and has been implemented in many state- 
of-the-art shared memory graph processors 15311931 , 
including Galois (86) . 

GraphChi The seminal single machine TLAV 
framework is GraphChi (69), which was explic¬ 
itly developed for large-scale graph processing on a 


commodity desktop. GraphChi enables large-scale 
graph processing by implementing the Parallel Slid¬ 
ing Window (PS W) method, a graph data layout pre¬ 
viously utilized for efficient PageRank and sparse- 
matrix dense-vector multiplication HUlIl. PSW 
partitions vertices into disjoint sets, associating with 
each interval a shard containing all of the interval’s 
incoming edges, sorted by source vertex. Intervals 
are selected to form balanced shards, and the num¬ 
ber of intervals is chosen so any interval can fit com¬ 
pletely in memory. A sliding window is maintained 
over every interval, so when vertices from one shard 
are updated from in-edges, the results can be se¬ 
quentially written to out-edges found in sorted or¬ 
der in the window on other shards. GraphChi may 
not be faster than most distributed frameworks, but 
often reaches convergence within an order of mag¬ 
nitude of the performance of distributed frameworks 
(69) , which is reasonable for a desktop with an order 
of magnitude less RAM. The GraphChi framework 
was later extended to a general graph management 
system for a single machine called GraphChi-DB 

ED- 

Storage concepts for single machine graph pro¬ 
cessing are further explored in 111471 through two 
directions. The first project investigates reduc¬ 
ing random accesses in SSDs through prefetching, 
in a project called RASP that later evolved into 
PrefEdge (87) . The second project is X-Stream 
DEI, an edge-centric single machine graph pro¬ 
cessing framework that exploits the trade-off be¬ 
tween random memory access and sequential access 
from streaming data. 

X-Stream Streaming data from any storage 
medium provides much greater bandwidth than ran¬ 
dom access. Experiments on the X-Stream testbed, 
for example, demonstrate that streaming data from 
disk is 500 times faster than random access fTOTi . 
X-Stream combines a novel data layout, where an 
index is built over a storage-based edge list with 
an edge-centric Scatter-Gather programming model 
that includes a shuffle phase. Data is read from, and 
updates are written to, streaming edge data. Though 
the framework is edge-centric, a user-defined update 
function is executed on the destination vertex of an 
edge. X-Stream reports that it can process a 64- 
billion edge graph on a single machine with a pair 
of 3TB magnetic disks attached (81) . 

FlashGraph While GraphChi and X-Stream are 
designed for general external storage, the Flash- 
Graph framework is developed for graphs stored 
on any fast I/O device, such as an array of SSDs. 
FlashGraph is deployed on top of the set-associative 
file system (SAFS) (1561 . which includes a scal¬ 
able lightweight page cache, and implements a cus¬ 
tom asynchronous user-task I/O interface that re- 
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Framework 

Storage Medium 

Data Layout 


GraphChi 

Disk/SSD 

Parallel Sliding Window 

OH 

X-Stream 

Disk/SSD 

Streaming Partitions 

[Ml 

FlashGraph 

SSD Array 

Semi-External Memory with Page Cache [156| 

PathGraph 

Disk/SSD 

Compressed DFS Traversal Trees 

fl48l 


TABLE III: Single Machine Frameworks 


duces overhead for asynchronous I/O. FlashGraph 
employs asynchronous message-passing and vertex¬ 
centric programming with the semi-external mem¬ 
ory (SEM) model l93l . where vertices and algorith¬ 
mic state reside in RAM, but edges are stored ex¬ 
ternally. In experiments comparing GraphChi and 
XStream, FlashGraph outperformed both by orders 
of magnitude even when the data for GraphChi and 
XStream was placed into RAM-disk Il56ft . 

PathGraph In addition to the path-centric pro¬ 
gramming model, further discussed in Section [V| 
PathGraph also implements a path-centric compact 
storage system that improves compactness and lo¬ 
cality 018]. Because most iterative graph algo¬ 
rithms involve path traversal, PathGraph stores edge 
traversal trees in depth-first search order. Both the 
forward and reverse edge trees are each stored in a 
chunk storage structure that compresses data struc¬ 
ture information including the adjacency set, ver¬ 
tex IDs, and the indexing of the chunk. The effi¬ 
cient computational model and storage structure of 
PathGraph resulted in improved graph loading time, 
lower memory footprint, and faster runtime for cer¬ 
tain algorithms when compared to GraphChi and X- 
Stream. 


V. ALTERNATIVE GRAPH GRANULARITY 


The strengths of the vertex-centric programming 
model are also its weaknesses. Whereas vertex pro¬ 
grams may be relatively simpler to reason about 
since only local data is available, the algorithms are 
less expressive than conventional centralized algo¬ 
rithms. While TLAV frameworks exhibit better seal- 
ability, execution can be slow because of high over¬ 
head from synchronization and message traffic that 
takes magnitudes longer compared to computation. 
Several frameworks strive for the best of both worlds 
by adopting a scope that is greater than a vertex but 


less than the graph, summarized in Table IV 


A. Subgraph-centric Frameworks 

Considering the challenges addressed by TLAV 
frameworks, taking a subgraph-centric approach is 


sensible. Conventional graph algorithms require the 
entire graph in memory, which is not possible with 
graphs of scale. A subgraph, though, can be par¬ 
titioned into a size small enough to fit into memory 
(considering computation) while the connections be¬ 
tween subgraphs would be no more, and likely much 
less, than the total number of edges. The system 
would better utilize processing while retaining seal- 
ability. 

The subgraph-centric programming model is im¬ 
plemented in varying degrees by several frame¬ 
works. The Giraph-H- 11261 . Blogel 111451 . and 
GoFFish <1191 frameworks provide a subgraph¬ 
centric interface for progrmaming sequential al¬ 
gorithms. Both Giraph-H- and Blogel provide a 
subgraph-centric interface in addition to a vertex¬ 
centric interface. The results of the sequential pro¬ 
grams can then be shared either through vertex pro¬ 
grams on boundary nodes, or in the case of Blogel, 
results can be shared directly between subgraphs. 
GoFFish exclusively offers a subgraph-centric in¬ 
terface, and implements messaging between sub¬ 
graphs and also from subgraphs to specific ver¬ 
tices, the latter being used for traversal algorithms. 
By allowing subgraphs to directly message vertices, 
any vertex-centric algorithm can be implemented 
by a subgraph-centric framework, maintaining seal- 
ability while enabling significant performance im¬ 
provement. Collectively, subgraph-centric frame¬ 
works dramatically outperform TLAV frameworks, 
often by orders of magnitude in terms of comput¬ 
ing time, number of messages, and total supersteps 
II 1261 1 1451 . 

The GraphHP |[23l and P++ 11571 frameworks do 
not implement an interface for sequential programs, 
but do differentiate between inter-partition nodes to 
improve performance. In these two frameworks, su¬ 
persteps are split into two phases: in the first phase 
messages are exchanged between vertices on parti¬ 
tion boundaries, and in the second phase, vertices 
within a partition repeatedly execute the vertex pro¬ 
gram to completion, exchanging messages in mem¬ 
ory. This method reduces communication and im¬ 
proves performance, however, iteratively executing 
intra-worker vertex programs is less efficient than 
executing a sequential algorithm. Message-passing 
algorithms are typically more scalable than sequen- 
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tial graph algorithms, but P++ is not distributed, 
nor is Block-based GRACE B143I . an extension of 
urn although the later demonstrates that executing 
vertex updates on a subgraph block basis improves 
locality and cache hits while reducing memory ac¬ 
cess time, which is a bottleneck for computationally 
light algorithms like PageRank. 

TLAV frameworks illustrate the principal ideas 
for scalable graph processing, but for the best perfor¬ 
mance, users may consider subgraph-centric frame¬ 
works. Subgraph frameworks leverage principles of 
TLAV frameworks to execute sequential graph algo¬ 
rithms in a distributed environment. The Giraph++, 
Blogel, and GoFFish frameworks reduce the scope 
of sequential graph algorithms for the subgraph to fit 
in memory while utilizing vertex or subgraph mes¬ 
saging to maintain scalability. Together, the vertex¬ 
centric and subgraph-centric programming model, 
compared to sequential graph algorithms, demon¬ 
strate how scalability varies inversely with scope. 


B. Other Scopes: Paths and Sets 

While subgraph-centric frameworks illustrate the 
scope/scalability trade-off, several other frameworks 
adopt alternative computational scopes that demon¬ 
strate additional benefits. 

A more specific type of subgraph, a traversal tree, 
is used for the programming model in PathGraph 
lfl48ll . Traversals are a fundamental component of 
many graph algorithms, including PageRank and 
Bellman-Ford shortest path. PathGraph first parti¬ 
tions the graph into paths, with each partition rep¬ 
resented as two trees, a forward and reverse edge 
traversal. Then, for the path-centric computational 
model, path-centric scatter and path-centric gather 
functions are available to the user to define an al¬ 
gorithm that traverse each tree. The user also de¬ 
fines a vertex update function, which is executed 
by the path-centric functions during the traversal. 
Like block-based GRACE, the path-centric model 
utilizes locality to improve performance through re¬ 
duced memory usage and efficient caching. Path- 
Graph also implements a path-centric storage model 
that enables the framework to process billion node 
graphs on a single machine (see Section [TV D| > 111481 . 

Graph processing frameworks designed for single 
machines can implement interfaces of unique gran¬ 
ularity. A vertex subset interface is implemented 
in Ligra |fTT7| . Ligra argues that high-end servers 
provide enough memory for large-scale graphs, and 
thus implements a vertex-centric programming in¬ 
terface while retaining a global view of the graph. 
Inspired by a hybrid breadth-first search (BFS) al¬ 
gorithm EL Ligra dynamically switches between 


sparse and dense representations of edge sets de¬ 
pending on the size of the vertex subset, which im¬ 
pacts whether push or pull operations are performed 
with the vertex subset. Polymer GlQ adopts a 
similar interface as Ligra, but with several NUMA- 
aware optimizations. Galois l68l is a shared mem¬ 
ory framework that executes user-defined set oper¬ 
ators while exploiting amorphous data parallelism 
(95l . Galois can be implement a variety of pro¬ 
gramming interfaces, including the vertex-centric 
paradigm lf86l . 


C. Optimizations 

Two optimizations have been introduced in 11 1011 
for TLAV frameworks that improve performance by 
adopting a scope of the graph other than vertex¬ 
centric. The Finishing Computation Serially (FCS) 
method is applicable when an algorithm with a 
shrinking set of active vertices converges slowly 
near the end of execution 11101 . The FCS method is 
triggered when the remaining active graph can fit in 
the memory of a single machine; in these instances 
the active portions are sent to the master and com¬ 
pleted serially from a global, shared memory per¬ 
spective of the graph. 

Similarly, the Single Pivot (SP) optimization 
moi, first presented in E5D, also temporarily 
adopts a global view. For algorithms that execute 
breadth-first search (BFS) across all vertices, e.g., 
the connected components algorithm, instead of ex¬ 
ecuting BFS from every node, which incurs a high 
messaging cost, SP randomly selects one vertex 
from the graph and performs BFS just from that ver¬ 
tex. Since most graphs have one big component, in 
addition to many small ones, the BFS from a ran¬ 
dom node can be executed until the big component 
is found, then BFS from every vertex that’s not in 
the big component can execute BFS to complete the 
algorithm, resulting in significantly fewer total mes¬ 
sages. This optimization adjusts scope by randomly 
selecting a single vertex by utilizing a global aggre¬ 
gator l80l . which also adopt a scope beyond vertex. 


VI. RELATED WORK 

In this paper, vertex-centric graph processing sys¬ 
tems for large-scale graphs are surveyed. In previous 
related work. Pregel and GraphLab have been com¬ 
pared 11081 . and general graph processing systems 
have been surveyed ||63l 1551 , and 4 TLAV frame¬ 
works have been empirically evaluated on 4 algo¬ 
rithms |491 . A tutorial on TLAV frameworks was 
recently delivered at an international conference El- 
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Framework 

Programming 

Model 

Sequential Vertex 
Algorithms Messaging 

Distributed 


Giraph++ 

Subgraph 

Y 

Y 

Y 

csa 

Blogel 

Subgraph 

Y 

Y 

Y 

M45I 

GoFFish 

Subgraph 

Y 

N 

Y 

H3ID 

GraphHP 

Subgraph 

N 

Y 

Y 

E3 

P++ 

Subgraph 

N 

Y 

N 

11571 

GRACE (block) 

Subgraph 

N 

Y 

N 

fl43i 

PathGraph 

Path 

N 

Y 

Y 

CM) 

Ligra 

Vertex Subset 

Y 

Y 

N 

nm 

Polymer 

Vertex Subset 

Y 

Y 

N 

cm 

Galois 

User-Defined Set 

Y 

Y 

N 

ED 


TABLE IV: Frameworks of Alternative Scope 


TLAV frameworks intersect several subjects, in¬ 
cluding graph processing, distributed computing. 
Big Data, and distributed algorithms. Several graph 
processing frameworks have been recently devel¬ 
oped outside of the vertex-centric programming 
model. PEGASUS combines the BSP model with 
generalized matrix-vector multiplication (GIM-V) 
lf58l . while TurboGraph introduces the pin-and-slide 
model to perform GIM-V on a single machine |50) . 
Combinatorial BLAS OS and the Parallel Boost 
Graph Library P31 are software libraries for high- 
performing parallel computation of sequential pro¬ 
grams. Piccolo performs distributed graph compu¬ 
tation using distributed tables l97l . 

Graph databases, such as Neo4j 033, Hyper- 
GraphDB (55], and GBASE (58], are decidedly dif¬ 
ferent from TLAV frameworks. Both treat vertices 
as first class citizens, and both face related prob¬ 
lems like partitioning, but the key distinction is that 
databases focus on transactional processing while 
TLAV frameworks focus on batch processing GZ). 
Databases offer local or online queries, such as 1- 
hop neighbors, whereas TLAV systems iteratively 
process the entire graph offline in batch. Some 
more general graph management systems, like Trin¬ 
ity II151 and Grace |98) . offer suites of features that 
include both vertex-centric processing and queries. 
Sensibly, a graph processing engine may be devel¬ 
oped on top of a graph database. However the two 
should not be confused, and performance is incom¬ 
parable. 

A closely related Big Data framework is MapRe¬ 
duce l32l l%l l. MapReduce is a different program¬ 
ming model from TLAV frameworks, but similarly 
enables large-scale computation and, when imple¬ 
mented, abstracts away the details of distributed pro¬ 
gramming. The programming model is effective for 
many types of computation, but addresses neither 
iterative processing nor graph processing l80l i961. 


Iterative computation is not natively supported, as 
the programming model performs only a single pass 
over the data with no loop awareness. Moreover, 
I/O is read/written to/from a distributed filesystem, 
e.g., HDFS, rendering iterative computation ineffi¬ 
cient (%)■ Nonetheless, several frameworks have 
extended MapReduce to support iterative computa¬ 
tion 113 [37 153 1 but such frameworks are still ag¬ 
nostic to the challenges of graph processing. Graph 
computation with MapReduce has been explored 
Eii, but is generally acknowledged to be lacking 
GOES. A comparison of MapReduce and BSP is 
provided in ll57l . Still, some argue that MapReduce 
should remain the sole “hammer” for Big Data ana¬ 
lytics because of the widespread adoption through¬ 
out industry |72) . 

Similarly, in response to TLAV shortcomings, 
such as poor out-of-core support and lengthy loading 
times, some frameworks rework pre-existing graph 
database technologies to provide a vertex-centric 
interface l38l . However, many of these projects 
lose sight of the main problems addressed by the 
vertex-centric processing. TLAV frameworks are ul¬ 
timately Big Data solutions, designed large graphs 
to be leveraged against the memory and process¬ 
ing power of several machines, not single machines. 
Moreover, TLAV frameworks iteratively process the 
entire graph, and do not provide graph queries like 1- 
hop or 2-hop neighbors. TLAV frameworks are not a 
universal solution for graph analytics, but rather pro¬ 
vide an approach for scalable, iterative graph pro¬ 
cessing. 

Temporal graph processing is beyond the scope 
of this survey, though a small number of TLAV 
frameworks have been developed for temporal anal¬ 
ysis USE)- These frameworks compute temporal 
properties offline in batch through graph snapshots, 
necessitating multiple framework components, in¬ 
cluding a front-end ingress component, an analyt- 
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ics engine, and a storage component such as a graph 
database. Temporal graph layout optimizations were 
introduced in Chronos ED- These frameworks il¬ 
lustrate how advanced graph analytics systems uti¬ 
lize the strengths of different graph technologies for 
different components, e.g., graph databases for stor¬ 
age and online queries, and vertex-centric computa¬ 
tion for batch analytics. Dynamic graph algorithms 
and general analytics systems have also been sur¬ 
veyed J2] 113211 . Dynamic graphs are supported by 
many frameworks including Pregel, but the topic 
was omitted from this survey due to widely varying 
support by the frameworks and broad scope of the 
topic. 

While coined ’’vertex-centric” relative to conven¬ 
tional graph processing approaches, the algorithms 
executed by TLAV frameworks are more formally 
known as distributed algorithms. Distributed algo¬ 
rithms is a mature held of study (79). and further 
examples beyond Figure [3] may be found within the 
referenced frameworks. Some works have explored 
distributed algorithms within the context of TLAV 
frameworks H1461 . but researchers and practitioners 
should be aware that TLAV frameworks execute dis¬ 
tributed algorithms l79l . which come from a held 
with a considerable body of work, including theory 
and analysis. The theoretical limits of what can be 
computed with vertex-centric frameworks, specifi¬ 
cally with the synchronous, message-passing LO¬ 
CAL model, has been studied (571 . 

This paper surveys and compares the various 
components of TLAV frameworks, which are a plat¬ 
form for executing vertex-centric algorithms. Like 
MapReduce, these frameworks provide an interface 
for a user-dehned function, while abstracting away 
the lower-level details of cluster computing. Chang¬ 
ing the components of the framework will impact 
system performance and run-time characteristics, 
but will generally not impact the design or result of 
the algorithm [? ]. 


VII. CONCLUSIONS 

TLAV frameworks have been designed in re¬ 
sponse to the challenges of processing large graphs. 
Primary challenges include the unstructured nature 
of graphs, where an edge may span any two ver¬ 
tices, so the entire graph must be randomly accessi¬ 
ble for conventional processing. TLAV frameworks 
are also developed for ease of use, providing a sim¬ 
ple vertex-centric interface while abstracting away 
the lower level details of cluster computing. MapRe¬ 
duce similarly enables highly scalable computing, 
but is ill-suited for iterative graph processing. 

By adopting a vertex-centric programming model. 


the scope of computation is dramatically reduced. 
To perform an update, each vertex only needs data 
from immediate neighbors. Data residing on a sepa¬ 
rate machine can be acquired directly between work¬ 
ers, avoiding the bottleneck of central coordination, 
enabling excellent scalability. The four pillars of 
the vertex-centric programming model, (i) timing, 
(ii) communication, (iii) the execution model, and 
(iv) partitioning, were presented and surveyed in 
the context of distributed graph processing frame¬ 
works. However, vertex-centric algorithms, collo¬ 
quially known as distributed algorithms, have an 
established history and are still actively researched 

EZJED. 

Several related frameworks were explored that 
similarly adopt a computational scope of the graph 
at varying granularity. These frameworks of alter¬ 
native scope are like a Goldilocks solution to graph 
processing. Centralized algorithms with the entire 
graph in scope require too much memory, vertex¬ 
centric algorithms can scale but are less expressive 
and require many relatively slow messages, whereas 
subgraph-centric algorithms can utilize the two re¬ 
sources just right. A significant contribution of 
TLAV frameworks is exposing how, for graphs, re¬ 
ducing the scope of a program increases scalability. 

Of course, expressing a particular algorithm as 
subgraph-centric is not trivial. The future of practi¬ 
cal large-scale distributed graph processing may be 
related to finding algorithms that process a graph as 
independent subgraphs, such as divide-and-conquer, 
or algorithms that can process graphs at multiple, or 
even dynamic, scopes jl36ft . The performance of 
the subgraph-centric processing is also closely tied 
to the effectiveness of large-scale graph partition¬ 
ing, including streaming and distributed partitioning 
techniques. 

TLAV frameworks are a tool for graph process¬ 
ing at scale. Not all graphs are large enough to ne¬ 
cessitate distributed processing, and not all graph 
problems need the whole graph to be computed it¬ 
eratively. Moreover, there is often more than one 
way to solve a problem, but these frameworks are 
simple to program, easy to distribute, and are not a 
bad choice for the right type of problem. Subgraph¬ 
centric frameworks take vertex-centric frameworks 
a step further for performance. Datasets will con¬ 
tinue to grow dramatically into the new age of Big 
Data, and the design of processing systems should 
begin asking if they can scale out infinitely. TLAV 
frameworks illustrate how conventional centralized 
systems will fail in the Big Data ecosystem, and how 
decentralized platforms must be embraced. 
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